15 research outputs found
IruMozhi: Automatically classifying diglossia in Tamil
Tamil, a Dravidian language of South Asia, is a highly diglossic language
with two very different registers in everyday use: Literary Tamil (preferred in
writing and formal communication) and Spoken Tamil (confined to speech and
informal media). Spoken Tamil is under-supported in modern NLP systems. In this
paper, we release IruMozhi, a human-annotated dataset of parallel text in
Literary and Spoken Tamil. We train classifiers on the task of identifying
which variety a text belongs to. We use these models to gauge the availability
of pretraining data in Spoken Tamil, to audit the composition of existing
labelled datasets for Tamil, and to encourage future work on the variety.Comment: 4 pages main text, 7 tota
Recommended from our members
SNACS Annotation of Case Markers and Adpositions in Hindi
We present in-progress annotation of semantic relations expressed through adpositions and case markers in a Hindi corpus. We used the multilingual SNACS annotation scheme, which has been applied to a variety of typologically diverse languages. Annotation problems in Hindi are examined and used to suggest changes to SNACS. We look towards finalizing the corpus and using it for future work in typology and semantic role-dependent tasks
CGELBank Annotation Manual v1.0
CGELBank is a treebank and associated tools based on a syntactic formalism
for English derived from the Cambridge Grammar of the English Language. This
document lays out the particularities of the CGELBank annotation scheme
Estimating the Entropy of Linguistic Distributions
Shannon entropy is often a quantity of interest to linguists studying the
communicative capacity of human language. However, entropy must typically be
estimated from observed data because researchers do not have access to the
underlying probability distribution that gives rise to these data. While
entropy estimation is a well-studied problem in other fields, there is not yet
a comprehensive exploration of the efficacy of entropy estimators for use with
linguistic data. In this work, we fill this void, studying the empirical
effectiveness of different entropy estimators for linguistic distributions. In
a replication of two recent information-theoretic linguistic studies, we find
evidence that the reported effect size is over-estimated due to over-reliance
on poor entropy estimators. Finally, we end our paper with concrete
recommendations for entropy estimation depending on distribution type and data
availability.Comment: 21 pages (5 pages main text). 4 figures. Accepted to ACL 202
UniMorph 4.0:Universal Morphology
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet